Augmented Proximal Policy Optimization for Safe Reinforcement Learning

نویسندگان

چکیده

Safe reinforcement learning considers practical scenarios that maximize the return while satisfying safety constraints. Current algorithms, which suffer from training oscillations or approximation errors, still struggle to update policy efficiently with precise constraint satisfaction. In this article, we propose Augmented Proximal Policy Optimization (APPO), augments Lagrangian function of primal constrained problem via attaching a quadratic deviation term. The constructed multiplier-penalty dampens cost oscillation for stable convergence being equivalent precisely control costs. APPO alternately updates and multiplier solving augmented primal-dual problem, can be easily implemented by any first-order optimizer. We apply our methods in diverse safety-constrained tasks, setting new state art compared comprehensive list safe RL baselines. Extensive experiments verify merits method easy implementation, convergence, control.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...

متن کامل

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(λ), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is efficient as it makes the b...

متن کامل

Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret

Lifelong reinforcement learning provides a promising framework for developing versatile agents that can accumulate knowledge over a lifetime of experience and rapidly learn new tasks by building upon prior knowledge. However, current lifelong learning methods exhibit non-vanishing regret as the amount of experience increases, and include limitations that can lead to suboptimal or unsafe control...

متن کامل

Safe exploration for reinforcement learning

In this paper we define and address the problem of safe exploration in the context of reinforcement learning. Our notion of safety is concerned with states or transitions that can lead to damage and thus must be avoided. We introduce the concepts of a safety function for determining a state’s safety degree and that of a backup policy that is able to lead the controlled system from a critical st...

متن کامل

Policy Improvement through Safe Reinforcement Learning in High-Risk Tasks

Reinforcement Learning (RL) methods are widely used for dynamic control tasks. In many cases, these are high risk tasks where the trial and error process may select actions which execution from unsafe states can be catastrophic. In addition, many of these tasks have continuous state and action spaces, making the learning problem harder and unapproachable with conventional RL algorithms. So, whe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i6.25888